Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

258 ◾ Bioinformatics

the centroid sequence (C) and skew of its abundance (aM) with respect to the centroid

sequence abundance (aC), which is given as

M C

(

skew

(7.3)

When a member unique sequence has both an enough small distance and an enough small

skew with respect to the centroid sequence, then it is likely that sequence is incorrect read

of the centroid sequence with d points errors. The maximum skew (β) allowed for a cluster

member with d differences from the centroid sequence is given by

β( )=

α +

(7.4)

where α is set to 2 by default.

We can notice that as the distance d between the member sequence and centroid

increases, the maximum skew β decreases exponentially.

The unique sequences with low abundance are removed by the UNOISE2 algorithm.

The final products of any of the clustering and denoising methods are feature table and the

list of representative sequences. The feature table provides the feature abundance or the

number of a times a feature has been observed in a sample. A feature is a unit of observa-

tion that can be an OTU or an ASV. The feature table is needed for the downstream analysis

such as taxonomy assignment, construction of phylogenetic tree, and diversity analysis.

7.2.3 Taxonomy Assignment

Given a set of representative sequences generated by the above-discussed clustering or

denoising methods, the taxonomy assignment step will attempt to assign taxa for each

sequence. There are several methods for assigning taxonomy but, in general, they can

be categorized into (i) alignment-based methods such as BLAST and VSEARCH and (ii)

machine learning methods such as Ribosomal Database Project (RDP) Classifier. The out-

put of the taxonomy assignment methods is mapping a representative sequence to taxa and

providing an assignment quality score.

7.2.3.1 Basic Local Alignment Search Tool

The Basic Local Alignment Search Tool or BLAST [11] is a widely used seed-based heuris-

tic sequence search tool whose algorithm is adopted from the Smith-Waterman algorithm

for local sequence alignment. Providing a representative sequence (generated from cluster-

ing or denoising) as a search query to BLAST, the search is conducted against a database of

sequences with known taxonomy. Rather than aligning to a single sequence, the taxonomy

assignment is based on the consensus of hits in the reference database that exceed the prede-

termined percent identity. If the blast hits agree on the same taxonomy, then the representa-

tive sequence will be given that taxonomy level with consensus greater than a threshold.